Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tesseract backend #375

Merged
merged 14 commits into from
Nov 20, 2024
Merged

Implement tesseract backend #375

merged 14 commits into from
Nov 20, 2024

Conversation

jonchang
Copy link
Collaborator

@jonchang jonchang commented Nov 7, 2024

Description

Implements the tesseract backend for OCR.

Testing notes

You can test this on the front-end by applying this patch:

diff --git a/OCR/ocr/api.py b/OCR/ocr/api.py
index 444c834..ec5603d 100644
--- a/OCR/ocr/api.py
+++ b/OCR/ocr/api.py
@@ -9,6 +9,7 @@ from fastapi import FastAPI, UploadFile, Form
 from fastapi.middleware.cors import CORSMiddleware
 
 from ocr.services.image_ocr import ImageOCR
+from ocr.services.tesseract_ocr import TesseractOCR
 from ocr.services.alignment import ImageAligner
 from ocr.services.image_segmenter import ImageSegmenter, segment_by_color_bounding_box
 
@@ -29,7 +30,7 @@ app.add_middleware(
 segmenter = ImageSegmenter(
     segmentation_function=segment_by_color_bounding_box,
 )
-ocr = ImageOCR()
+ocr = TesseractOCR()
 
 
 def data_uri_to_image(data_uri: str):

Related Issues

#321

Checklist

  • The title of this PR is descriptive and concise.
  • My changes follow the style guidelines of this project.
  • I have added or updated test cases to cover my changes.
  • I've let the team know about this PR by linking it in the review channel

@arinkulshi-skylight arinkulshi-skylight linked an issue Nov 12, 2024 that may be closed by this pull request
3 tasks
@jonchang jonchang marked this pull request as ready for review November 20, 2024 16:19
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekadombek these are the dockerfile-related changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh gotcha! Kinda what I was imagining. makes sense. like what we chatted about earlier, it shouldn't be much of a difference in build time. Now that we're adding this though, do you know if we're able to eliminate other installed dependencies to make these images smaller or will they still be needed?

Not sure if we'll be able to get this in or not by January, but it would be nice to scan these images for CVEs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest I have no clue why ffmpeg and xlib are in there. I can look into it though if the image size is a problem. I also note that we don't clean up after apt update which is also a concern

Copy link
Collaborator

@arinkulshi-skylight arinkulshi-skylight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Lets create a new ticket to call the function in the API and test the entire flow.

# Nothing matched, just return the default path
return tesserocr.get_languages()[0]

def image_to_text(self, segments: dict[str, np.ndarray]) -> dict[str, tuple[str, float]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: init class and invoke fxn in api call.

@jonchang jonchang added this pull request to the merge queue Nov 20, 2024
Merged via the queue into main with commit 88ffe5b Nov 20, 2024
2 checks passed
@jonchang jonchang deleted the tesseract-backend branch November 20, 2024 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Tesseract as an alternative model that can be used in backend
3 participants